Augmented Partial Mutual Learning with Frame Masking for Video Captioning
نویسندگان
چکیده
Recent video captioning work improves greatly due to the invention of various elaborate model architectures. If multiple models are combined into a unified framework not only by simple more ensemble, and each can benefit from other, final might be boosted further. Jointly training have been explored in previous works. In this paper, we propose novel Augmented Partial Mutual Learning (APML) method where decoders trained jointly with mimicry losses between different input variations. Another problem is "one-to-many" mapping which means that one identical mapped caption annotations. To address problem, an annotation-wise frame masking approach convert "one-to-one" mapping. The experiments performed on MSR-VTT MSVD datasets demonstrate our proposed algorithm achieves state-of-the-art performance.
منابع مشابه
Deep Learning for Video Classification and Captioning
Accelerated by the tremendous increase in Internet bandwidth and storage space, video data has been generated, published and spread explosively, becoming an indispensable part of today's big data. In this paper, we focus on reviewing two lines of research aiming to stimulate the comprehension of videos with deep learning: video classification and video captioning. While video classification con...
متن کاملVideo captioning with recurrent networks based on frame- and video-level features and visual content classification
In this paper, we describe the system for generating textual descriptions of short video clips using recurrent neural networks, which we used while participating in the Large Scale Movie Description Challenge 2015 in ICCV 2015. Our work builds on static image captioning system proposed in [22] and implemented in [8] and extends this framework to videos utilizing both static image features and v...
متن کاملVideo Captioning via Hierarchical Reinforcement Learning
Video captioning is the task of automatically generating a textual description of the actions in a video. Although previous work (e.g. sequence-to-sequence model) has shown promising results in abstracting a coarse description of a short video, it is still very challenging to caption a video containing multiple fine-grained actions with a detailed description. This paper aims to address the cha...
متن کاملEnd-to-End Video Captioning with Multitask Reinforcement Learning
Although end-to-end (E2E) learning has led to promising performance on a variety of tasks, it is often impeded by hardware constraints (e.g., GPU memories) and is prone to overfitting. When it comes to video captioning, one of the most challenging benchmark tasks in computer vision and machine learning, those limitations of E2E learning are especially amplified by the fact that both the input v...
متن کاملReconstruction Network for Video Captioning
In this paper, the problem of describing visual contents of a video sequence with natural language is addressed. Unlike previous video captioning work mainly exploiting the cues of video contents to make a language description, we propose a reconstruction network (RecNet) with a novel encoder-decoder-reconstructor architecture, which leverages both the forward (video to sentence) and backward (...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence
سال: 2021
ISSN: ['2159-5399', '2374-3468']
DOI: https://doi.org/10.1609/aaai.v35i3.16301